Re-estimation of Lexical Parameters for Treebank PCFGs

نویسنده

  • Tejaswini Deoskar
چکیده

We present procedures which pool lexical information estimated from unlabeled data via the Inside-Outside algorithm, with lexical information from a treebank PCFG. The procedures produce substantial improvements (up to 31.6% error reduction) on the task of determining subcategorization frames of novel verbs, relative to a smoothed Penn Treebank-trained PCFG. Even with relatively small quantities of unlabeled training data, the re-estimated models show promising improvements in labeled bracketing f-scores on Wall Street Journal parsing, and substantial benefit in acquiring the subcategorization preferences of low-frequency verbs.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Corpus Induction of Lexicons for Treebank PCFGs by Inside-Outside Estimation and Frequency Transformations

We describe procedures which pool lexical information from a treebank with frequency information estimated from an unannotated corpus with the insideoutside algorithm. PCFG parameters for non-lexical productions are obtained purely from the treebank. The procedures produce substantial improvements (upto 20.34%) on the task of determining valences of tokens of novel verbs, relative to a smoothed...

متن کامل

Induction of Treebank-Aligned Lexical Resources

By ‘treebank-aligned lexical resources’ we mean ones where there is a systematic correspondence between the lexical resource and treebank syntactic resources. For instance, the lexicon resource contains features representing the subcategorization frames of verbs, which correspond to structural configurations that the verb occurs in, in a treebank. Given such an alignment, a treebank can be comp...

متن کامل

Three-Dimensional Parametrization for Parsing Morphologically Rich Languages

Current parameters of accurate unlexicalized parsers based on Probabilistic ContextFree Grammars (PCFGs) form a twodimensional grid in which rewrite events are conditioned on both horizontal (headoutward) and vertical (parental) histories. In Semitic languages, where arguments may move around rather freely and phrasestructures are often shallow, there are additional morphological factors that g...

متن کامل

Stochastic Analysis of Lexical and Semantic Enhanced Structural Language Model

In this paper, we present a directed Markov random field model that integrates trigram models, structural language models (SLM) and probabilistic latent semantic analysis (PLSA) for the purpose of statistical language modeling. The SLM is essentially a generalization of shift-reduce probabilistic push-down automata thus more complex and powerful than probabilistic context free grammars (PCFGs)....

متن کامل

Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations

We address the issue of consuming heterogeneous annotation data for Chinese word segmentation and part-of-speech tagging. We empirically analyze the diversity between two representative corpora, i.e. Penn Chinese Treebank (CTB) and PKU’s People’s Daily (PPD), on manually mapped data, and show that their linguistic annotations are systematically different and highly compatible. The analysis is f...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008